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5_^ . Abstract 

^T( , Many clustering algorithms exist that estimate a cluster centroid, such as K-means, _R'-niedoids or 

'^'h ' mean-shift, but no algorithm seems to exist that clusters data by returning exactly K meaningful modes. 

^^ ^ We propose a natural definition of a _ftr-modes objective function by combining the notions of density 

fNl ■ and cluster assignment. The algorithm becomes A'-means and /("-medoids in the limit of very large and 
very small scales. Computationally, it is slightly slower than J\-means but much faster than mean-shift 

. , or if-medoids. Unlike _ft'-means, it is able to find centroids that are valid patterns, truly representative 

\^ • of a cluster, even with nonconvex clusters, and appears robust to outliers and misspecification of the 

l_J ' scale and number of clusters. 

y^ [ Given a dataset xi, . . . , xjv G M.^ , we consider clustering algorithms based on centroids, i.e., that estimate 



a representative c^ £ M^ of each cluster k in addition to assigning data points to clusters. Two of the most 
widely used algorithms of this type are i^-means and mean-shift, ii'-means has the number of clusters K as 
a user parameter and tries to minimize the objective function 
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min£;(R,C) = VVr„fc||x„-Cfcf (1) 

R.,C 

k=ln=l 

K 

s.t. r„fc e {0,1}, ^r„fc = 1, 71 = l,...,iV, k = l,...,K 



k=l 



where R are binary assignment variables (of point n to cluster k) and C = (ci, . . . , ck) are centroids, free 
to move in M.^ . At an optimum, centroid c^ is the mean of the points in its cluster. Gaussian mean- 
►v>( ■ shift Fukunaga and Hostctlcr (1975); Cheng (1995); Carreira-Perpifian (2000); Comaniciu and Meer (2002) 

^ I assumes wc have a kernel density estimate (kde) with bandwidth a > and kernel G{t) oc e~*/^ 



1 ^ 



n = l 

and applies the iteration (started from each data point): 

l||(x-x„)/af) 
'EZ^i exp (-h ||(x - x„,)/a\\ 



P(n|x)= /"P^ !"n/"^^"w ^2^ ' x^f(x) = ^M-|x)x„. (3) 



which converges to a mode (local maximum) of p from nearly any initial x Carreira-Perpihan (2007). Each 
mode is the centroid for one cluster, which contains all the points that converge to its mode. The user 
parameter is the bandwidth a and the resulting number of clusters depends on it implicitly. 

The pros and cons of both algorithms are well known. .ftT-mcans tends to define round clusters; mean-shift 
can obtain clusters of arbitrary shapes and has been very popular in low-dimensional clustering applications 
such as image segmentation Comaniciu and Mecr (2002), but docs not work well in high dimension. Both 
can be seen as special EM algorithms Bishop (2006); Carreira-Perpihan (2007). Both suffer from outliers. 



which can move centroids outside their cluster in ii'-means or create singleton modes in mean-shift. Com- 
putationally, iiT-means is much faster than mean-shift, at 0{KND) and 0{N'^D) per iteration, respectively, 
particularly with large datasets. In fact, accelerating mean-shift has been a topic of active research Carreira- 
Perpifian (2006); Yuan et al. (2010). Mean-shift does not require a value oiK, which is sometimes convenient, 
although many users often find it desirable to force an algorithm to produce exactly K clusters (e.g. if prior 
information is available). 

One important aspect in many applications concerns the validity of the centroids as patterns in the input 
space, as well as how representative they are of their cluster. Fig. 1 illustrates this with a single cluster 
consisting of continuously rotated digit-1 images. Since these images represent a nonconvex cluster in the 
high-dimensional pixel space, their mean (which averages all orientations) is not a valid digit-1 image, which 
makes the centroid not interpretable and hardly representative of a digit 1. Mean-shift does not work well 
either: to produce a single mode, a large bandwidth is required, which makes the mode lie far from the 
manifold; a smaller bandwidth does produce valid digit-1 images, but then multiple modes arise for the same 
cluster, and under mean-shift they define each a cluster. Clustering applications that require valid centroids 
for nonconvex or manifold data abound (e.g. images, shapes or proteins). 

A third type of centroid-based algorithms are exemplar-based or X-medoid clustering Kaufman and 
Rousseeuw (1990); Bishop (2006); Hastie et al. (2009). These constrain the centroids to be points from the 
dataset ("exemplars"), such as if -medians, and often minimize a i^T-means objective function (1) with a 
non-Euclidean distance. They are slow, since updating centroid Ck requires testing all pairs of points in 
cluster k. Forcing the centroids to be exemplars is often regarded as a way to ensure the centroids are 
valid patterns. However, the exemplars themselves are often noisy and thus not that representative of their 
neighborhood. Not constraining a centroid to be an exemplar can remove such noise and produce a more 
typical representative. 

Given that most location statistics have been used for clustering (mean, mode, median), it is remarkable 
that no K-vaodes formulation for clustering seems to exist, that is, an algorithm that will find exactly K 
modes that correspond to meaningful clusters. An obvious way to define a K-'modes algorithm is to pick 
K modes from a kde, but it is not clear what modes to pick (assuming it has at least K modes, which 
will require a sufficiently small bandwidth) . Picking the modes with highest density need not correlate well 
with clusters that have an irregular density, or an approximately uniform density with close but distinct 
high-density modes. 

We define a K-Toodes objective as a natural combination of two ideas: the cluster assignment idea from 
if -means and the density maximization idea of mean-shift. The algorithm has two interesting special cases, 
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Figure 1: A cluster of 7 rotated-1 USPS digit images and the centroids found by A'-means, A'-modes (both 
with K = 1) and mean-shift (with a so there is one mode). 



if -means and a version of if-medoids, in the limits of large and small bandwidth, respectively. For small 
enough bandwidth, the centroids are denoised, valid patterns and typical representatives of their cluster. 
Computationally, it is slightly slower than if -means but much faster than mean-shift or if-mcdoids. 

1 A A'-modes Objective Function 

We maximize the objective function 

K N 

maxL(R,C) = ^ ^ r„feG(||(x„ - Cfc)/af ) (4) 

fc=ln=l 
K 

s.t. r„fc e {0,1}, ^r„fe = 1, n = l,...,iV, k = l,...,K. 
fe=i 

For a given assignment R, this can be seen as (proportional to) the sum of a kde as in eq. (2) but separately 
for each cluster. Thus, a good clustering must move centroids to local modes, but also define K separate 
kdes. This naturally combines the idea of clustering through binary assignment variables with the idea that 
high-density points are representative of a cluster (for suitable bandwidth values). 

As a function of the bandwidth a, the if-modes objective function has two interesting limit cases. When 
a — !► 00, it becomes if-means. This can be seen from the centroid update (which becomes the mean), or 
from the objective function directly. Indeed, approximating it with Taylor's theorem for very large a and 
using the fact that J2k=i ''^nk = 1 gives 

N N 
fe=l n=l 

where i?(R, C) is the same as in eq. (1) and is subject to the same constraints. Thus, maximizing L 
becomes minimizing E, exactly the if -means problem. When ct — > 0, it becomes a if-medoids algorithm, 
since the centroids are driven towards data points. Thus, if -modes interpolates smoothly between these two 
algorithms, creating a continuous path that links a if -mean to a if-medoid. However, its most interesting 
behavior is for intermediate a. 

2 Two iC-niodes Algorithms 

As is the case for if -means and A'-medoids, minimizing the A'-modes objective function is NP-hard. We 
focus on iterative algorithms that find a locally optimum clustering in the sense that no improvement is 
possible on the centroids given the cmrrent assignments, and vice versa. We give first an algorithm for fixed 
a and then use it to construct a homotopy algorithm that sweeps over a a interval. 

For Fixed a 

It is convenient to use alternating optimization: 

Assignment step Over assignments R for fixed C, the constrained problem separates into a constrained 
problem for each point x„ , of the form 

K K 

max y^ r„fcff„fc s.t. ^ r„fe = 1, r„k £ {0, 1}, fc = 1, . . . , if, 

" k=l fc=l 

with Qnk — G'(||(x„ — Ck)/cr\\j. The solution is given by assigning point x„ to its closest centroid in 
Euclidean distance (assuming the kernel G is a decreasing function of the Euclidean distance). 



Mode-finding step Over centroids C for fixed R, we have a separate unconstrained maximization for cacli 
centroid, of the form 

N 

L(cfe) = ^r„feG(||(x„ -Ck)/(j\f), 

71=1 

which is proportional to the cluster kde, and can be done with mean-shift. Note the step over C need 
not be exact, i.e., the centroids need not converge to their corresponding modes. We exit when a 
tolerance is met or / mean-shift iterations have been run. 

Thus, the algorithm operates similarly to ii'-means but finding modes instead of means: it interleaves a hard 
assignment step of data points to centroids with a mode-finding step that moves each centroid to a mode of 
the kde defined by the points currently assigned to it. 

Convergence of this algorithm (in value) follows from the facts that each step (over R or over C) is strictly 
feasible and decreases the objective or leaves it unchanged, and that the objective function is lower bounded 
by within the feasible set. Besides, since there is a finite number of assignments, convergence occurs in a 
finite number of outer-loop steps (as happens with if -means) if the step over C is exact and deterministic. 
By this we mean that for each c^ we find deterministically a maximum of its objective function (i.e., the 
mode for 0^ is a deterministic function of R). This prevents the possibility that for the same assignment 
R we find different modes for a given Cfc, which could lead the algorithm to cycle. This condition can be 
simply achieved by using an optimization algorithm that either has no user parameters (such as step sizes; 
mean-shift is an example), or has user parameters set to fixed values, and running it to convergence. The 
(R*, C*) convergence point is a local maximum in the sense that L(R*, C) has a local maximum at C = C* 
and i(R, C*) has a global maximum at R = R*. 

The computational cost per outer-loop iteration of this algorithm (setting I ~ 1 for simplicity in the 
mean-shift step) is identical to that of /ST-means: the step over R is 0{KND) and the step over C is 
0{NiD -I- • • • -I- NkD) = 0{ND) (where N^ is the number of points currently assigned to Cfe), for a total 
of 0{KND). And also as in if-means, the steps parallelize: over C, the mean-shift iteration proceeds 
independently in each cluster; over R, each data point can be processed independently. 

Homotopy Algorithm 

We start with ct = oo (i.e., run .K-means, possibly several times and picking the best optimum). Then, we 
gradually decrease a while running J iterations of the fixcd-tr iC-modes algorithm for each value of tr, until 
we reach a target value a*. This follows an optimum path (R(cr), C(cr)) for a G [(t*,oo). In practice, as is 
well known with homotopy techniques, this tends to find better optima than starting directly at the target 
value a* . We use this homotopy algorithm in our experiments. Given we have to run _R'-means multiple 
times to find a good initial optimum (as commonly done in practice), the homotopy does not add much 
computation. Note that the homotopy makes iiT-modes a deterministic algorithm given the local optimum 
found by K-vcicans. 

User Parameters 

The basic user parameter of A'-modes is the desired number of clusters K. The target bandwidth a* in 
the homotopy is simply used as a scaling device to refine the centroids. We find that representative, valid 
centroids are obtained for a wide range of intermediate a values. A good target a* can be obtained with 
a classical bandwidth selection criterion for kernel density estimates Wand and Jones (1994), such as the 
average distance to the kt\i nearest neighbor. 

Practically, a user will typically be interested in the K centroids and clusters resulting for the target 
bandwidth. However, examining the centroid paths Ck{(j) can also be interesting for exploratory analysis of 
a dataset, as illustrated in our experiments with handwritten digit images. 

3 Relation with Other Algorithms 

AT- modes is most closely related to AT-means and to Gaussian mean-shift (GMS), since it essentially intro- 
duces the kernel density estimate into the AT-means objective function. This allows AT-modes to find exactly 



K true modes in the data (in its mathematical sense, i.e., maxima of the kde for each cluster), while achieving 
assignments as in iiT-means, and with a fast runtime, thus enjoying some of the best properties from both 
/^- means and GMS. 

if -means and X-modes have the same update step for the assignments, but the update step for the 
centroids is given by setting each centroid to a different location statistic of the points assigned to it: the 
mean for A'- means, a mode for if- modes, if- means and if -modes also define the same class of clusters 
(a Voronoi tessellation, thus convex clusters), while GMS can produce possibly nonconvex, disconnected 
clusters. 

In GMS, the number of clusters equals the number of modes, which depends on the bandwidth a. If one 
wants to obtain exactly if modes, there are two problems. The first one is computational: since if is an 
implicit, nonlinear function of a, finding a a value that produces if modes requires inverting this function. 
This can be achieved numerically by running mean-shift iterations while tracking if (cr) as in scale-space 
approaches Collins (2003), but this is very slow. Besides, particularly for high-dimensional data, the kde 
only achieves if modes for a very narrow (even empty) interval of a. The second problem is that even with 
an optimally tuned bandwidth, a kde will usually create undesirable, spurious modes where data points are 
sparse (e.g. outliers or cluster boundaries), again particularly with high-dimensional data. We avoid this 
problem in the homotopy version of if -modes by starting with large cr, which tracks important modes. The 
difference between if -modes and GMS is clearly seen in the particular case where we set if = 1 (as in fig. 1): 
if -modes runs the mean-shift update initialized from the data mean, so as a decreases, this will tend to find 
a single, major mode of the kde. However, the kde itself will have many modes, all of which would become 
clusters under GMS. 

The fundamental problem in GMS is equating modes with clusters. The true density of a cluster may 
well be multimodal to start with. Besides, in practice a kde will tend to be bumpy unless the bandwidth 
is unreasonably large, because it is by nature a sum of bumpy kernels centered at the data points. This 
is particularly so with outliers (which create small modes) or in high dimensions. There is no easy way to 
smooth out a kde (increasing the bandwidth does smooth it, but at the cost of distorting the overall density) 
or to postprocess the modes to select "good" ones. One has to live with the fact that a good kde will often 
have multiple modes per cluster. 

if -modes provides one approach to this problem, by separating the roles of cluster assignment and of 
density. Each cluster has its own kde, which can be multimodal, and the homotopy algorithm tends to select 
an important mode among these within each cluster. This allows if-modes to achieve good results even in 
high-dimensional problems, where GMS fails. 

Computationally, if-modes and if -means are 0{KND) per iteration for a datasct of N points in D 
dimensions. While if-modes in its homotopy version will usually take more iterations, this extra runtime is 
small because in practice one runs if -means multiple times from different initializations to achieve a better 
optimum. GMS is 0{N^D) per iteration, which is far slower, particularly with large datasets. The reason 
is that in GMS the kde involves all N points and one must run mean-shift iterations started from each of 
the N points. However, in if-modes the kde for cluster k involves only the Nk points assigned to it and one 
must run mean-shift iterations only for the centroid Ck- Much work has addressed approximating GMS so 
that it runs faster, and some of it could be applied to the mean-shift step in if-modes, such as using Newton 
or sparse EM iterations Carreira-Perpihan (2006). 

In addition to these advantages, our experiments show that if-modes can be more robust than A'-means 
and GMS with outliers and with misspecification of either if or cr. 

There are two variations of mean-shift that replace the local mean step of eq. (3) with a different statistic: 
the local (Tukey) median Shapira et al. (2009) and a medoid defined as any dataset point which minimizes 
a weighted sum of squared distances Sheikh et al. (2007). Both are really medoid algorithms, since they 
constrain the centroids to be data points, and do not find true modes (maxima of the density). In gen- 
eral, if-medoid algorithms such as if-centers or if-medians are combinatorial problems, typically NP-hard 
Hochbaum and Shmoys (1985); Kaufman and Rousseeuw (1990); Meyerson et al. (2004). In the limit cr ^- 0, 
if-modes can be seen as a deterministic annealing approach to a if-medoids objective (just as the elastic 
net Durbin and Willshaw (1987) is for the traveling salesman problem). 

There exists another algorithm called "if-modes" Huang (1998); Chaturvedi et al. (2001). This is de- 
fined for categorical data and uses the term "mode" in a generic sense of "centroid". It is unrelated to 
our algorithm, which is defined for continuous data and uses "mode" in its mathematical sense of density 
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Figure 2: ii'-modes results for two bandwidth values using K = 2. Wc show the means *, their within-cluster 
nearest neighbor o, the modes •, the paths followed by each mode as a decreases, and the contours of each 
kde. Each TiT-modes cluster uses a different color. 



a ^ oo (A"-means) 



cr^O.l 




Figure 3: Like fig. 2 but for the two-moons dataset. 
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4 Experiments 

We compare with iiT-means and Gaussian mean-shift (GMS) clustering. For K-vacaiis, we run it 20 times 
with different initializations and pick the one with minimum value of E in eq. (1). For A'-modes, we use its 
homotopy version initialized from the best i<'-means result and finishing at a target bandwidth (whose value 
is set either by using a kde bandwidth estimation rule or by hand, depending on the experiment). 



Toy Examples 

Figures 2 and 3 illustrate the three algorithms in 2D examples. They show the K modes and the kde contours 
for each cluster, for ct = oo or equivalently iiT- means (left panel) and for an intermediate a (right panel). We 
run A'-modes decreasing a geometrically in 20 steps from 3 to 1 in fig. 2 and from 1 to 0.1 in fig. 3. 

In fig. 2, which has 3 Gaussian clusters, we purposefully set K = 2 (both iiT-means and iiT-modes work 



well with K = 3). This makes _ftr- means put one of the centroids in a low-density area, where no input 
patterns are found. -ftT-modes moves the centroid inside a cluster in a maximum-density area, where many 
input patterns lie, and is then more representative. 

In fig. 3, the "two-moons" dataset has two nonconvex, interleaved clusters and we set K ~ 2. The 
"moons" cannot be perfectly separated by either K-m.ea.ns or K-m.odes, since both define Voronoi tessel- 
lations. However, iiT-modes does improve the clusters over those of X-means, and as before it moves the 
centroids from a region where no patterns are found to a more typical location within each cluster. Note 
how, although the bandwidth used (u = 0.1) yields a very good kde for each cluster and would also yield a 
very good kde for the whole dataset, it results in multiple modes for each "moon", which means that GMS 
would return around 13 clusters. In this dataset, no value of a results in two modes that separate the moons. 

One might argue that, if a ii"- means centroid is not a valid pattern, one could simply replace it with the 
data point from its cluster that is closest to it. While this sometimes works, as would be the case in the 
rotated-digit-1 of fig. 1, it often does not: the same-cluster nearest neighbor could be a point on the cluster 
boundary, therefore atypical (fig. 2) or even a point in the wrong cluster (fig. 3). if- modes will find points 
interior to the clusters, with higher density and thus more typical. 

Degree Distribution of a Graph 

We construct an undirected graph similar to many real-world graphs and estimate the distribution of the 
degree of each vertex Newman (2010). To construct the graph, we generated a random (Erdos-Rcnyi) graph 
(with 1 000 vertices and 9 918 edges), which has a Gaussian degree distribution, and a graph with a power- law 
(long-tailed) distribution (with 3 000 vertices and 506 489 edges), and then took the union of both graphs 
and added a few edges at random connecting the two subgraphs. The result is a connected graph with two 
types of vertices, reminiscent of real-world networks such as the graph of web pages and their links in the 
Internet. Thus, our dataset has N = 4 000 points in ID (the degree of each vertex). As shown in fig. 4, 
the degree distribution is a mixture of two distributions that are well-separated but have a very different 
character: a Gaussian and a skewed, power-law distribution. The latter results in a few vertices having a 
very large degree (e.g. Internet hubs), which practically appear as outliers to the far right (outside the plots). 

We set K = 2. K-means obtains a wrong clustering. One centroid is far to the right, in a low-density 
(thus unrepresentative) region, and determines a cluster containing the tail of the power-law distribution; 
this is caused by the outliers. The other centroid is on the head of the power-law distribution and determines 
a cluster containing the Gaussian and the head of the power-law distribution. 

We run X-modes decreasing a from 200 to 1 geometrically in 40 steps. X-modcs shifts the centroids to 
the two principal modes of the distributions and achieves a perfect clustering. Note that the kde for the 
power-law cluster has many modes, but iiT-modes correctly converges to the principal one. 

GMS cannot separate the two distributions for any value of a. Setting a small enough that the kde has 
the two principal modes implies it also has many small modes in the tail because of the outliers (partly 
visible in the second panel). This is a well-known problem with kernel density estimation. 

Handwritten Digit Images 

We selected 100 random images (16 x 16 grayscale) from the USPS dataset for each digit 0-9. This gives a 
dataset of A^ = 1 000 points in [0, 1]^^^. We ran AT-means and if-modcs with K = 10, decreasing a from 10 
to 1 geometrically in 100 steps. 

Fig. 5 shows that most of the centroids for K-means are blurry images consisting of an average of digits 
of different identity and style (slant, thickness, etc.), as seen from the 20 nearest-neighbor images of each 
centroid (within its cluster). Such centroids are hard to interpret and are not valid digit images. This 
also shows how the nearest neighbor to the centroid may be an unusual or noisy input pattern that is not 
representative of anything except itself. 

X-modes unblurs the centroids as a decreases. The class histograms for the 20 nearest-neighbors show 
how the purity of each cluster improves: for AT-means most histograms are widely distributed, while A'-modes 
concentrates the mass into mostly a single bin. This means A'-modes moves the centroids onto typical regions 
that both look like valid digits, and are representative of their neighborhood. This can be seen not just from 
the class labels, but also from the style of the digits, which becomes more homogeneous under A'-modes (e.g. 
see cluster C2, containing digit-6 images, or C4 and C5, containing digit-0 images of different style). 
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Figure 4: Degree distribution of a graph. Left column: a histogram of the distribution, colored according to 
the X- modes clustering for cr = oo (7^- means) to cr = 1; the black vertical bar indicates the cluster boundary. 
Middle column: the kde for each cluster with /•sT-modes. Right column: the kde for the whole dataset with 
GMS. The X axis is truncated to a degree of 800, so many outlying modes to the right are not shown. 



Stopping i^-modes at an intermediate a (preventing it from becoming too small) achieves just the right 
amount of smoothing. It aUows the centroids to look like valid digit images, but at the same time to average 
out noise, unusual strokes or other idiosyncrasies of the datasct images (while not averaging digits of different 
identities or different styles, as X- means does). This yields centroids that are more representative even than 
individual images of the datasct. In this sense, ii'-modcs achieves a form of intelligent dcnoising similar to 
that of manifold dcnoising algorithms Wang and Carrcira-Perpifian (2010). 

Note that, for K-vaodes, centroids Cg and Cg look very similar, which suggests one of them is redundant 
(while none of the ii'-means centroids looked very similar to each other). Indeed, removing Cg and rerunning 
K-vaodes with K = ^ simply reassigns nearly all data points in the cluster of Cg to that of Cg and the ccntroid 
itself barely changes. This is likely not a casuality. If we have a single Gaussian cluster but use A' > 1, it 
will be split into sectors like a pie, but in iT-means the centroids will be apart from each other, while in 
if-modes they will all end up near the Gaussian center, where the mode of each kde will lie. This suggests 
that redundancy may be easier to detect in K-vaoAes than in if-means. 

GMS with a = 1.8369 gives exactly 10 modes, however of these one is a slanted-digit- 1 cluster like Cg 
in /v-modcs and contains 98.5% of the training set points, and the remaining 9 modes are associated with 
clusters containing between 1 and 4 points only, and their centroids look like digits with unusual shapes, 
i.e., outliers. As noted before, GMS is sensitive to outliers, which create modes at nearly all scales. This 
is particularly so with high-dimensional data, where data is always sparse, or with data lying on a low- 
dimensional manifold (both of which occur here). In this case, the kde changes from a single mode for large 
CT to a multiplicity of modes over a very narrow interval of a. 

High-dimensional Datasets with Ground- Truth Labels 

Finally, we report clustering statistics in datasets with known pattern class labels (which the algorithms did 
not use): (1) COIL-20, which contains 32 x 32 grayscale images of 20 objects viewed from varying angles. 
(2) MNIST, which contains 28 x 28 grayscale handwritten digit images (wc randomly sample 200 of each 
digit). And (3) the NIST Topic Detection and Tracking (TDT2) corpus, which contains on-topic documents 
of different semantic categories (we removed documents appearing in more than one category and kept only 
the largest 30 categories). 

For each algorithm, we compare its clustering with the ground-truth one using two commonly used criteria 
for evaluating clustering results: the Adjusted Rand Index and the Normalized Mutual Information Manning 
et al. (2008). The results appear in table 1. For /\ -means, we show the best result of 20 random initializations, 
ii'-modcs was initialized from this 7^- means result and run by homotopy to a target bandwidth a* . We show 
two results for if-modes, each with a different bandwidth value. The one in parentheses corresponds to 
a target a* estimated as the average distance of each point to its 10th nearest neighbor (a commonly 
used bandwidth estimation rule). The other one (not in parentheses) corresponds to the best result for 
(7 G [^,oo), i.e., we enlarge a bit the interval of bandwidths for the homotopy. For GMS, we select a to 
give exactly the desired K modes (which, as before, is cumbersome) . The best results for each dataset are in 
boldface. iT-modes improves over ii'-means if using a bandwidth estimated automatically, but it improves 
even more if searching further bandwidths. GMS gives poor results for the reasons described earlier. 



Tabic 1: Clustering accuracy for high-dimensional datasets (size N, dimension D, number of classes K). 
N/A means our GMS code ran out of memory. 





Adjusted Rand Index 


(%) 


Normalized Mutual Info 
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Dataset(7V,D,A') 


A'-means A'-modcs 


GMS 


A'-means A'-modcs 


GMS 


COIL(1440,1024,20) 
MNIST(2000, 784, 10) 
TDT2(9394, 36771, 30) 


56.5 62.1 (62.1) 
32.9 35.5 (34.4) 
55.8 56.1 (56.1) 


11.6 
1.37 

N/A 


76.8 79.1 (79.1) 
46.4 49.2 (47.4) 
80.0 80.8 (80.7) 


49.8 
11.7 

N/A 
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Figure 5: Clustering results on USPS data with if -modes with if = 10 for cr = oo (i.e., if-means, top panel) 
and (7 = 1 (middle panel), and for GMS with <t = 1.8369 (bottom panel), which achieves if = 10 modes. 
In each panel, each row corresponds to a cluster k = 1, . . . , A' = 10. The leftmost image shows the ccntroid 
Cfe and the right 20 images are the 20 nearest neighbors to it within cluster k. The right panel shows the 
histogram of class labels (color-coded) for the neighbors. 
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Summary 

The previous experiments suggest that ii'-inodes is more robust than i^-means and GMS to outhers and 
parameter misspecification {K or a) . Outhers shift centroids away from the main mass of a cluster in K- 
means or create spurious modes in GMS, but X-modes is attracted to a major mode within each cluster. 
GMS is sensitive to the choice of bandwidth, which determines the number of modes in the kde. However, 
iiT-modes will return exactly K modes (one per cluster) no matter the value of the bandwidth, and whether 
the kde of the whole dataset has more or fewer than K modes, i^-means is sensitive to the choice of K: if 
it is smaller than the true number of clusters, it may place centroids in low-density regions between clusters 
(which are invalid patterns); if it is larger than the true number of clusters, multiple centroids will compete 
for a cluster and partition it, yet the resulting centroids may show no indication that this happened. With 
ii'-modes, if K is too small the centroids will move inside the mass of each cluster and become valid patterns. 
If K is too large, centroids from different portions of a cluster may look similar enough that their redundancy 
can be detected. 

5 Discussion 

While A'-modes is a generic clustering algorithm, an important use is in appplications where one desires 
representative centroids in the sense of being valid patterns, typical of their cluster, as described earlier. By 
making a small enough, X-niodcs can always force the centroids to look like actual patterns in the training 
set (thus, by definition, valid patterns). However, an individual pattern is often noisy or idiosyncratic, and 
a more typical and still valid pattern should smooth out noise and idiosyncrasies — ^just as the idea of an 
"everyman" includes features common to most men, but does not coincide with any actual man. Thus, 
best results are achieved with intermediate bandwidth values: neither too large that they average widely 
different patterns, not too small that they average a single pattern, but just small enough that they average 
a local subset of patterns — where the average is weighted, as given by eq. (3) but using points from a single 
cluster. Then, the bandwidth can be seen as a smoothing parameter that controls the representativeness of 
the centroids. Crucially, this role is separate from that of AT, which sets the number of clusters, while in 
mean-shift both roles arc conflated, since the bandwidth determines both the smoothing and the number of 
clusters. 

How to determine the best bandwidth value? Intuitively, one would expect that bandwidth values 
that produce good densities should also give reasonable results with isT-modes. Indeed, this was the case 
in our experiments using a simple bandwidth estimation rule (the average distance to the fcth nearest 
neighbor). In general, what "representative" means depends on the application, and if-modes offers potential 
as an exploratory data analysis tool. By running the homotopy algorithm from large bandwidths to small 
bandwidths (where "small" can be taken as, say, one tenth of the result from a bandwidth estimator), the 
algorithm conveniently presents to the user a sequence of centroids spanning the smoothing spectrum. As 
mentioned before, the computational cost of this is comparable to that of running i^-means multiple times 
to achieve a good optimum in the first place. Finally, in other applications, one may want to use A'-modes 
as a post-processing of the A'-means centroids to make them more representative. 

6 Conclusion and Future Work 

Our A'-modes algorithm allows the user to work with a kernel density estimate of bandwidth a (like mean- 
shift clustering) but produce exactly K clusters (like AT- means). It finds centroids that arc valid patterns 
and lie in high-density areas (unlike A'- means), are representative of their cluster and neighborhood, yet 
they average out noise or idiosyncrasies that exist in individual data points. Computationally, it is slightly 
slower than A'-means but far faster than mean-shift. Theory and experiments suggest that it may also be 
more robust to outliers and parameter misspecification than A'- means and mean-shift. 

Our A'-modes algorithm can use a local bandwidth at each point rather than a global one, and non- 
gaussian kernels, in particular finite-support kernels (such as the Epanechnikov kernel) may lead to a faster 
algorithm. We are also working on AT-modes formulations where the assignment variables are relaxed to be 
continuous. 
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A main application for iT-modes is in clustering problems where the centroids must be interpretable as 
valid patterns. Beyond clustering, /ST-modes may also find application in problems where the data fall in a 
nonconvcx low-dimensional manifold, as in finding landmarks for dimensionality reduction methods de Silva 
and Tenenbaum (2003), where the landmarks should lie on the data manifold; or in spectral clustering Ng 
et al. (2002), where the projection of the data on the eigcnspace of the graph Laplacian defines a hypersphere. 
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